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Abstract. The Parikh vector p(s) of a string s over a finite ordered alphabet 
E = {ai, . . . ,aCT} is defined as the vector of multiplicities of the characters, 
p{s) = (pi, . . . ,pcr), where pi — \{j \ sj — ai}\. Parikh vector q occurs in s if s 
has a substring t with p{t) = q. The problem of searching for a query g in a 
text s of length n can be solved simply and worst-case optimally with a sliding 
window approach in 0(n) time. We present two novel algorithms for the case 
where the text is fixed and many queries arrive over time. 
The first algorithm only decides whether a given Parikh vector appears in a 
binary text. It uses a linear size data structure and decides each query in 0(1) 
time. The preprocessing can be done trivially in 0{n'^) time. 
The second algorithm finds all occurrences of a given Parikh vector in a text 
over an arbitrary alphabet of size cr > 2 and has sub-linear expected time 
complexity. More precisely, we present two variants of the algorithm, both 
using an 0(n) size data structure, each of which can be constructed in 0{n) 
time. The first solution is very simple and easy to implement and leads to am 
expected query time of 0{n{^^^y^^ '"^a^ )' where m = qi is the length of 
a string with Parikh vector q. The second uses wavelet trees and improves the 
expected runtime to 0{n{Y^^)^^^ ^J^), i.e., by a factor of logm. 

Keywords: Parikh vectors, permuted strings, pattern matching, string algorithms, 
average case analysis, text indexing, non-standard string matching 
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1 Introduction 

Parikh vectors of strings count the multiplicity of the characters. They have been 
reintroduced many times by different names (compomer [5] , composition [3] , Parikh 
vector [22], permuted string [7], permuted pattern [10], and others). They are natural 
objects to study, due to their numerous applications; for instance, in computational 
biology, they have been applied in alignment algorithms [3], SNP discovery [5], re- 
peated pattern discovery [10], and, most naturally, in interpretation of mass spec- 
trometry data [4] . Parikh vectors can be seen as a generalization of strings, where we 
view two strings as equivalent if one can be turned into the other by permuting its 
characters; in other words, if the two strings have the same Parikh vector. 

The problem we are interested in here is answering the question whether a query 
Parikh vector q appears in a given text s (decision version), or where it occurs (oc- 
currence version). An occurrence of q is defined as an occnu'rence of a substring t 
of s with Parikh vector q. The problem can be viewed as an approximate pattern 
matching problem: We are looking for an occurrence of a jumbled version of a query 
string t, i.e. for the occurrence of a substring f which has the same Parikh vector. In 
the following, let n be the length of the text s, m the length of the query q (defined 
as the length of a string t with Parikh vector q), and a the size of the alphabet. 

The above problem (both decision and occurrence versions) can be solved with 
a simple sliding window based algorithm, in 0{n) time and 0{a) additional storage 
space. This is worst case optimal with respect to the case of one query. However, when 
we expect to search for many queries in the same string, the above approach leads 
to 0{Kn) runtime for K queries. To the best of our knowledge, no faster approach 
is known. This is in stark contrast to the classical exact pattern matching problem, 
where all exact occurrences of a query pattern of length m are sought in a text of 
length n. In that case, for one query, any naive approach leads to 0{nm) runtime, 
while quite involved ideas for preprocessing and searching are necessary to achieve an 
improved runtime of 0{n + m), as do the Knuth-Morris-Pratt [17], Boyer-Moore [6] 
and Boyer-Moore-type algorithms (see, e.g., [2, 14]). However, when many queries are 
expected, the text can be preprocessed to produce a data structure of size linear in 
n, such as a suffix tree, suffix array, or suffix automaton, which then allows to answer 
individual queries in time linear in the length of the pattern (see any textbook on 
string algorithms, e.g. [23, 18]). 

1.1 Related work 

Jumbled pattern matching is a special case of approximate pattern matching. It has 
been used as a filtering step in approximate pattern matching algorithms [15], but 
rarely considered in its own right. 

The authors of [7] present an algorithm for finding all occurrences of a Parikh 
vector in a runlength encoded text. The algorithm's time complexity is 0{n' + cr), 
where n' is the length of the runlength encoding of s. Obviously, if the string is not 
runlength encoded, a preprocessing phase of time 0{n) has to be added. However, 
this may still be feasible if many queries are expected. To the best of our knowledge, 
this is the only algorithm that has been presented for the problem we treat here. 
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An efficient algorithm for computing all Parikh fingerprints of substrings of a given 
string was developed in [1]. Parikh fingerprints are Boolean vectors where the fc'th 
entry is 1 if and only if Uk appears in the string. The algorithm involves storing a 
data point for each Parikh fingerprint, of which there are at most 0{na) many. This 
approach was adapted in [10] for Parikh vectors and applied to identifying all repeated 
Parikh vectors within a given length range; using it to search for queries of arbitrary 
length would imply using Q{P{s)) space, where P{s) denotes the number of diflferent 
Parikh vectors of substrings of s. This is not desirable, since, for arbitrary alphabets, 
there are non-trivial strings of any length with quadratic P{s) [8]. 

1.2 Results 

In this paper, we present two novel algorithms which perform significantly better than 
the simple window algorithm, in the case where many queries arrive. 

For the binary case, we present an algorithm which answers decision queries in 
0(1) time, using a data structure of size 0{n) (Interval Algorithm, Sect. 3). The data 
structure is constructed in 0{n^) time. 

For general alphabets, we present an algorithm with expected sublinear runtime 
which uses 0{n) space to answer occurrence queries (Jumping Algorithm, Sect. 4). 
We present two different variants of the algorithm. The first one uses a very simple 
data structure (an inverted table) and answers queries in time 0(CTJlog(j + m)), 
where J denotes the number of iterations of the main loop of the algorithm. We then 
show that the expected value of J for the case of random strings and patterns is 
0(^=-^==), yielding an expected runtime of 0{n{j^^)^/^^-^^^), per query 

The second variant of the algorithm uses wavelet trees [13] and has query time 
0{aJ), yielding an overall expected runtime of 0(n(j^^)^/^^^), per query. (Here 
and in the following, log stands for logarithm base 2.) 

Our simulations on random strings and real biological strings confirm the sublinear 
behavior of the algorithms in practice. This is a significant improvement over the 
simple window algorithm w.r.t. expected runtime, both for a single query and repeated 
queries over one string. 

The Jumping Algorithm is reminiscent of the Boyer-Moore-like approaches to 
the classical exact string matching problem [6,2,14]. This analogy is used both in 
its presentation and in the analysis of the number of iterations performed by the 
algorithm. 

2 Definitions and problem statement 

Given a finite ordered alphabet S = {oi, . . . , a^}, ai < . . . < a^- For a string s € S* , 
s = si . . . s„, the Parikh vector p{s) = (pi, . . . ,p„) of s defines the multiplicities of 
the characters in s, i.e. Pi = \{j \ Sj = aj}|, for « = 1, . . . , a. For a Parikh vector p, 
the length \p\ denotes the length of a string with Parikh vector p, i.e. \p\ = J2iPi- 
occurrence of a Parikh vector p in s is an occurrence of a substring t with p(t) = p. 
(An occurrence of t is a pair of positions < i < j < n, such that Si . . . sj = t.) 
A Parikh vector that occurs in s is called a sub-Parikh vector of s. The prefix of 
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length i is denoted pr{i) = pr{i,s) = si...Sj, and the Parikh vector of pr{i) as 
prv{i) = prv{i, s) = p(pr(i)). 

For two Parikh vectors p,q £ N"' , we define p < q and p + q component- wise: 
p < q ii and only if p, < for alH = 1, . . . , a, and p + q = u where Ui = Pi + qi for 
i = 1, . . . ,a. Similarly, for p < q, we set q — p = v where Vi = qi — Pi for i = 1, . . . , a. 

Jumbled Pattern Matching (JPM). Let s G S* he given, \s\ = n. For 

a Parikh vector q G N°' (the query), \q\ = to, find all occurrences of q in s. 
The decision version of the problem is where we only want to know whether 
q occurs in s. 

We assume that K many queries arrive over time, so some preprocessing may be 
worthwhile. 

Note that for X = 1, both the decision version and the occurrence version can 
be solved worst-case optimally with a simple window algorithm, which moves a fixed 
size window of size m along string s. Maintain the Parikh vector c of the current 
window and a counter r which counts indices i such that Cj ^ qi- Each sliding step 
costs either or 2 update operations of c, and possibly one increment or decrement of 
r. This algorithm solves both the decision and occurrence problems and has running 
time 0{n), using additional storage space 0{(t). 

Precomputing, sorting, and storing all sub-Parikh vectors of s would lead to ©(n^) 
storage space, since there are non- trivial strings with a quadratic number of Parikh 
vectors over arbitrary alphabets [8]. Such space usage is inacceptable in many appli- 
cations. 

For small queries, the problem can be solved exhaustively with a linear size index- 
ing structure such as a suffix tree, which can be searched down to length m ~ \q\ (of 
the substrings), yielding a solution to the decision problem in time 0{a"^). For finding 
occurrences, report all leaves in the subtrees below each match; this costs 0{M) time, 
where M is the number of occurrences of q in s. Constructing the suffix tree takes 
0{n) time, so for m = o(logn), we get a total runtime of 0(n), since M < n for any 
query q. 

3 Decision problem in the binciry case 

In this section, we present an algorithm for strings over a binary alphabet which, 
once a data structure of size 0{n) has been constructed, answers decision queries in 
constant time. It makes use of the following nice property of binary strings. 

Lemma 1. Let s € {a,b}* with \s\ = n. Fix I < m < n. If the Parikh vectors 
{xi,m — xi) and {x2,m — X2) both occur in s, then so does {y,m — y) for any xi < 

y < X2- 

Proof. Consider a sliding window of fixed size m moving along the string and let 
{pi,P2) be the Parikh vector of the current substring. When the window is shifted by 
one, the Parikh vector either remains unchanged (if the character falling out is the 
same as the character coming in), or it becomes {pi + I,p2 — 1) resp. {pi — l,p2 + 1) (if 
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they are different). Thus the Parikh vectors of substrings of s of length m build a set 
of the form {{x,m ~ x) \ x = pniin(m), pmin(m) + 1, . . . , pmax(m)} for appropriate 
pmin(m) and pmax(m). □ 

Assume that the algorithm has access to the values pmin(m) and pmax(m) for 
m = 1, . . . ,n; then, when a query q = (x,y) arrives, it answers yes if and only if 
X e [pmin(a; + y),pmax(a; + y)]. The query time is 0{1). 

The tabic of the values pmin(m) and pmax(m) can be easily computed in a pre- 
processing step in time 0{n^) by scanning the string with a window of size m, for 
each m. Alternatively, lazy computation of the table is feasible, since for any query 
q, only the entry m = \q\ is necessary. Therefore, it can be computed on the fly as 
queries arrive. Then, any query will take time 0(1) (if the appropriate entry has 
already been computed), or 0{n) (if it has not). After n queries of the latter kind, 
the table is completed, and all subsequent queries can be answered in 0(1) time. If 
we assume that the query lengths are uniformly distributed, then this can be viewed 
as a coupon collector problem where the coupon collector has to collect one copy of 
each length m. Then the expected number of queries needed before having seen all n 
coupons is nHn !v nlnn (see e.g. [11]). The algorithm will have taken 0{n^) time to 
answer these nlnn queries. 

The assumption of the uniform length distribution may not be very realistic; how- 
ever, even if it does not hold, we never take more time than 0{n^ + K) for K many 
queries. Since any one query may take at most 0{n) time, our algorithm never per- 
forms worse than the simple window algorithm. Moreover, for those queries where 
the table entries have to be computed, we can even run the simple window algorithm 
itself and report all occurrences, as well. For all others, we only give decision answers, 
but in constant time. 

The size of the data structure is 2n = 0{n). The overall running time for either 
variant is 0(K + n^). As soon as the number of queries is if = w(n), both variants 
outperform the simple window algorithm, whose running time is 0{Kn). 

Example 1. Let s = ababbaabaabbbaaabbab. In Table 1, we give the table of pmin and 
pmax for s. This example shows that the locality of pmin and pmax is preserved only 
in adjacent levels. As an example, the value pmax(3) = 3 corresponds to the substring 
aaa appearing only at position 14, while pmax(5) = 4 corresponds to the substring 
aabaa appearing only at position 6. 

Table 1. An example of the linear data structure for answering queries in constant time. 

m 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
pmin 0001223344556778899 10 



pmax 123344455677788999 10 10 
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4 The Jumping Algorithm 

In this section, we introduce our algorithm for general alphabets. We first give the 
main algorithm and then present two different implementations of it. The first one, 
an inverted prefix table, is very easy to understand and to implement, takes 0{n) 
space and 0(n) time to construct (both with constant 1), and can replace the string. 
Then we show how to use a wavelet tree of s to implement our algorithm, which has 
the same space requirements as the inverted table, can be constructed in 0{n) time, 
and improves the query time by a factor of logm. 

4.1 Main algorithm 

Let s = Si . . . Sn & he given. Recall that prv{i) denotes the Parikh vector of the 
prefix of s of length i, for i = 0, . . . , n, where prv{Q) = p{e) = (0, . . . , 0). Consider 
Parikh vector p € N*^, p ^ (0, . . . , 0). We make the following simple observations: 

Observation 1 1. For any 0<i<j<n, p = prv{j) — prv{i) if and only if p 

occurs in s at position {i + l,j). 
2. If an occurrence of p ends in position j, then prv{j) > p. 

The algorithm moves two pointers L and R along the text, pointing at these 
potential positions i and j. Instead of moving linearly, however, the pointers are 
updated in jumps, alternating between updates of R and L, in such a manner that 
many positions are skipped. Moreover, because of the way we update the pointers, 
after any update it suffices to check whether R — L = \q\ to confirm that an occurrence 
has been found (cf. Lemma 2 below). 

We first need to define a function firstfit, which returns the smallest potential 
position where an occurence of a Parikh vector can end. Let p gN^, then 

FIRSTFIT (p) := min{j | prv{j) > p}, 

and set firstfit(p) = do if no such j exists. We use the following rules for updating 
the two pointers, illustrated in Fig. 1. 

Updating R: Assume that the left pointer is pointing at position L, i.e. no un- 
reported occurrence starts before L + 1. Notice that, if there is an occurrence of q 
ending at any position j > L, it must hold that prv{L) + q< prv{j). In other words, 
we must fit both prv{L) and q at position j, so we update R to 

R firstfit(pto(L) + q). 

Updating L: Assume that R has just been updated. Thus, prv{R) — prv{L) > q 
by definition of FIRSTFIT. If equality holds, then we have found an occurrence of q in 
position (L + 1, R), and L can be incremented by 1. Otherwise prv{R) — prv{L) > q, 
which implies that, interspersed between the characters that belong to q, there are 
some "superfluous" characters. Now the first position where an occurrence of q can 
start is at the beginning of a contiguous sequence of characters ending in R which all 
belong to q. In other words, we need the beginning of the longest sufiix oi s[L + l,R] 
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q" q 



Fig. 1. The situation after the update of R (above) and after the update of L (below). R is 
placed at the first fit oi prv{L) + q, thus q' is a super-Parikh vector of q. Then L is placed 
at the beginning of the longest good suffix ending in R, so q" is a sub-Parikh vector of q. 



with Parikh vector < q, i.e. the smallest position i such that prv{R) —prv(i) < q, or, 
equivalently, prv{i) > prv{R) — q. Thus we update L to 

L ^ FIRSTFIT(pr?;(i?) — q). 

Finally, in order to check whether we have found an occurrence of query q, after 
each update of R or L, we check whether R — L = \q\. In Figure 2, we give the 
pseudocode of the algorithm. 

Algorithm Jumping Algorithm 
Input: query Parikh vector q 

Output: A set Occ containing all beginning positions of occurrences of g in s 



1. set m ^ |g|; Occ ^ 0; L ^ 0; 

2. while L < n — m 

3. do <- FiRSTFiT(pri;(L) + q); 

4. a R-L = m 

5. then add L + 1 to Occ; 

6. + 

7. else L <- firstfit (prw(i?) - g); 

8. if R-L = m 

9. then add L + 1 to Occ; 

10. L-f-L + 1; 

11. return Occ; 



Fig. 2. Pseudocode of Jumping Algorithm 



It remains to see how to compute the firstfit and prv functions. We first prove 
that the algorithm is correct. For this, we will need the following lemma. 

Lemma 2. The following algorithm invariants hold: 
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1. After each update of R, we have prv{R) — prv{L) > q. 

2. After each update of L, we have prv{R) — prv{L) < q. 

3. L<R. 

Proof. 1. follows directly from the definition of FIRSTFIT and the update rule for R. 
For 2., if an occurrence was found at (i, j), then before the update we have L = i — 1 
and R = j. Now L is incremented by 1, so L = i and prv{R) — prv{L) = q — < q, 
where is the fc'th unity vector. Otherwise, L FIRSTFIT (prt;(i?) — q), and again 
the claim follows directly from the definition of FIRSTFIT. For 3., if an occurrence 
was found, then L is incremented by 1, and R — L = m — 1 > 0. Otherwise, L = 
FIRSTFIT (prz;(i?) - (?) = min{^ | prv{i) > prv{R) - q} < R. □ 

Theorem 1. Algorithm Jumping Algorithm is correct. 

Proof. We have to show that (1) if the algorithm reports an occurrence, then it is 
correct, and (2) if there is an occurrence, then the algorithm will find it. 

(1) If the algorithm reports an index i, then (z,i + m — 1) is an occurrence of q: 
An index i is added to Occ whenever R — L = m. li the last update was that of R, 
then we have prv{R) — prv{L) > qhy Lemma 2, and together with R — L = m = |q| , 
this implies prv{R) — prv(L) = q, thus {L + 1,R) = + m — 1) is an occurrence 
of q. If the last update was L, then prv{R) — prv{L) < q, and it follows analogously 
that prv(R) — prv{L) = q. 

(2) All occurrences of q are reported: Let's assume otherwise. Then there is a 

minimal i and j = i + m — \ such that p{s[i.j\) = q but i is not reported by the 
algorithm. By Observation 1, we have prv{j) — prv{i — 1) — q. 

Let's refer to the values of L and R as two sequences {Lk)k=i,2,... and {Rk)k=i,2,...- 
So we have Li ~ 0, and for all k > 1, Rk — firstfit {prv {L k) + q) , and Lk+i = + 1 
iiRk—Lk = m and Lk+i = firstfit {prv{Rk) — q) otherwise. In particular, ifc+i > L/. 
for all k. 

First observe that if for some fc, Lk = « — 1, then R will be updated to j in 
the next step, and we are done. This is because Rk = firstfit {prv (Lk) + q) = 
FiRSTFiT(pru(i — 1) + g) = FiRSTFiT(pr?;(j)) = j. Similarly, if for some k, Rk = j, 
then we have Lfc+i = i — 1. 

So there must be a fc such that Lk < i—l < Lk+i. Now look at Rk. Since there is an 
occurrence of q after Lk ending in j, this implies that Rk = FIRSTFIT {prv {Lk)+q) < j. 
However, we cannot have Rk = j, so it follows that Rk < j. On the other hand, 
i — 1 < Lk+i < Rk by our assumption and by Lemma 2. So Rk is pointing to a 
position somewhere between i—l and j, i.e. to a position within our occurrence of 
q. Denote the remaining part of q to the right of Rk by q': q' = prv{j) — prv{Rk). 
Since Rk = firstfit {prv{Lk) + q), all characters of q must fit between Lk and Rk, 
so the Parikh vector p = prv{i) — prv{Lk) is a super-Parikh vector oi q' .li p = q' , 
then there is an occurrence of q at {Lk + and by minimality of this 

occurrence was correctly identified by the algorithm. Thus, Lk+i = Lfc + 1 < i — 1, 
contradicting our choice of fc. It follows that p > q' and we have to find the longest 
good suffix of the substring ending in Rk for the next update Lk+i of L. But s[i, Rk\ 
is a good sufhx because its Parikh vector is a sub-Parikh vector of g, so Lk+i = 
¥iRSTFiT{prv{Rk) — q) <i—l, again in contradiction to ife+i > i — l. □ 
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4.2 Variant using an inverted table 

Storing all prefix vectors of s would require 0{(Tn) storage space, which may be 
too much. Instead, we construct an "inverted prefix vector table" / containing the 
increment positions of the prefix vectors: for each character a^. (E and each value 
j up to p{s)k, the position in s of the j'th occurrence of character a^. Formally, 
/[A;][j] = min{i | prv{i)k > j} for j > 1, and /[A;][0] = 0. Then we have 

firstfit(p) = max {/[A;][pfe]}. 

fe=l,...,(7 

We can also compute the prefix vectors prv{i) from table /: For k = 1, . . . ,a, 

prv{j)k = max{i | I[k]\i] < j} 

The obvious way to find these vahies is to do binary search for j in each row 
of /. However, this would take time 0(cr log n); a better way is to use information 
already acquired during the run of the algorithm. By Lemma 2, it always holds that 
L < R. Thus, for computing prv{R)k, it suffices to search for R between prv{L)k and 
prv{L)k + {R— L). This search takes time proportional to log(i?— L). Moreover, after 
each update of L, we have L > R — m, so when computing prv{L)k, we can restrict 
the search for L to between prv{R)k — m and prv{R)k, in time O(logm). For more 
details, see Section 4.4. 

Table / can be computed in one pass over s (where we take the liberty of identifying 
character ak & S with its index k). The variables Cfe count the number of occurrences 
of character ak seen so far, and are initialized to 0. 

Algorithm Construct I 

1. for z = 1 to n 

2. Cs, = c,. + 1; 

3. I[si][cs,] = i; 
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Table / requires 0{n) storage space (with constant 1). Moreover, the string s can 
be discarded, so we have zero additional storage. (Access to Sj, 1 < i < n, is still 
possible, at cost 0{a log n).) 



Example 2. Let S = {a, b, c} and s = cabcccaaabccbaacca. The prefix vectors of s are 
given below. Note that the algorithm does not actually compute these. 



pos. 




1 


2 


3 
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6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


s 
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a 


b 


c 


c 


c 


a 


a 


a 


b 


c 


c 


b 


a 


a 


c 


c 


a 


#a's 








1 


1 


1 


1 


1 


2 


3 


4 


4 


4 


4 


4 


5 


6 


6 


6 


7 


#6's 











1 


1 


1 


1 


1 


1 


1 


2 


2 


2 


3 


3 


3 


3 


3 


3 


#c's 





1 


1 


1 


2 


3 


4 


4 


4 


4 


4 


5 


6 


6 


6 


6 


7 


8 


8 



The inverted prefix table /: 








1 


2 


3 


4 


5 


6 7 8 


a 





2 


7 


8 


9 


14 


15 18 


b 





3 


10 


13 








c 





1 


4 


5 


6 


11 


12 16 17 



Query q — (3, 1, 2) has 4 occurrences, beginning in positions 5, 6, 7, 13, since (3, 1, 2) = 
prv{10) — prw(4) = prv{ll) — prw(5) = prv{12) — prv{6) = prv{18) — prv{12). The 
values of L and R are given below: 



k, see proof of Thm. 1 


1 


2 3 4 


5 


6 


7 


L 





4 5 6 


7 


10 


12 


R 


8 


10 11 12 


14 


18 


18 


occurrence found? 




yes yes yes 






yes 



4.3 Variant using a wavelet tree 

A wavelet tree on s G E* allows rank, select, and access queries in time 0{\oga). For 
a/c S S, rankk{s,i) = \{j \ sj = ak,j < i}\, the number of occurrences of character 
flfc up to and including position i, while selectk{s,i) = mm{j \ rankk{s,j) > i}, the 
position of the I'th occurrence of character a/.- When the string is clear, we just use 

rankk{i) and selectk{i). Notice that 

- prv{j) = {ranki{j), rank„{j)), and 

— for a Parikh vector p ^ (pi, ■ ■ ■ ,P<t), firstfit(p) = ma.Xk=i^,,,^„{selectk{pk)}- 

So we can use a wavelet tree of string s to implement those two functions. We give 
a brief recap of wavelet trees, and then explain how to implement the two functions 
above in 0{a) time each. 

A wavelet tree is a complete binary tree with cr = |I7| many leaves. To each inner 
node, a bitstring is associated which is defined recursively, starting from the root, in 
the following way. If = 1, then there is nothing to do (in this case, we have reached 
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a leaf). Else split the alphabet into two roughly equal parts, Sieft and i^right- Now 
construct a bitstring of length n from s by replacing each occurrence of a character 
a by if a e -S'left, and by 1 if a € bright- Let sieft be the subsequence of s consisting 
only of characters from Z'lcft , and Sright that consisting only of characters from I7right ■ 
Now rccursc on the left child with string sieft and alphabet Sidt, and on the right 
child with Sright and Z'right- An illustration is given in Fig. 4. At each inner node, in 
addition to the bitstring B, we have a data structure of size o(|-B|), which allows to 
perform rank and select queries on bit vectors in constant time ([20, 9, 21]). 

Now, using the wavelet tree of s, any rank or select operation on s takes time 
0(log a), which would yield 0{a log a) time for both prv{j) and firstfit(j3). However, 
we can implement both in a way that they need only 0{a) time: In order to compute 
rankkij), the wavelet tree, which has log cr levels, has to be descended from the root 
to leaf k. Since for prv{j), we need all values ranki{j), . . . , ranka{j) simultaneously, 
we traverse the complete tree in 0((t) time. 

For computing FiRSTFiT(p), we need maxfe{seZectfc(pfe)}, which can be computed 
bottom-up in the following way. We define a value Xu for each node u. If u is a leaf, 
then u corresponds to some character ak S set x„ = Pk- For an inner node w, let 
be the bitstring at u. We define by x^ = max{,seZecio(i?«, xica), selecti{Bu, a;right)}j 
where a;ieft and a;right are the values already computed for the left resp. right child of 
u. The desired value is equal to Xroot- 

Example 3. Let s — bbacaccabaddabccaaac (cp. Fig. 4). We demonstrate the com- 
putation of firstfit(2, 3, 2, 1) using the wavelet tree. We have firstfit(2, 3, 2, 1) 
= 'max{selecta{s,2), selectb{s,3), selectc{s,2), selectd{s,l)}, where in slight abuse of 
notation we put the character in the subscript instead of its number. Denote the 
bottom left bitstring as Ba^bi the bottom right one as -Bc,dj and the top bitstring 
as Ba^b,c,d- Then we get max{se/ecio(-Ba_f,, 2), se/ec<i(i?a,b, 3)} = max{4, 6} = 6, and 
max{5e/ec<b(-Bc,d, 2), select\{Bc^d, 1)} = niax{2,4} = 4. So at the next level, we com- 
pute max{seZecio(Ba,6,c,d,6), selecti{Ba,b,c,dA)} = niax{9, 11} = 11. 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 



00010110001100110001 




1 2 3 4 5 6 7 8 9 10 11 12 



1 2 3 4 5 6 7 8 



bb a a a ba a b a a a 
110001001000 



00011000 



Fig. 4. The wavelet tree for string bbacaccabaddabccaaac. For clarity, the leaves have been 
omitted. Note also that the third line at each inner node (the strings over the alphabet 
{o, b, c, d}) are only included for illustration. 
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4.4 Algorithm Analysis 

Let Ai(s, q) denote the running time of the Jumping Algorithm using inverted tables 
over a text s and a Parikh vector q, and A2(s, q) that of the Jumping Algorithm using 
a wavelet tree. Further, let J = J{s, q) be the number of iterations performed in the 
while loop in line 2, i.e., the number of jumps performed by the algorithm on the 
input q. 

The time spent in each iteration depends on how the functions firstfit and prv 
are implemented (lines 3 and 7). In the wavelet tree implementation, as we saw before, 
both take time 0{a), so the overall runtime of the algorithm is 

A2(s,g) = 0(a J). (1) 

For the inverted table implementation, it is easy to see that computing FIRSTFIT 
takes 0((t) time. Now denote, for each z = 1, . . . , J, by Li, i?j the value of L and R 
after the z'th execution of line 3 of the algorithm, respectively.^ The computation of 
prv{Li) in line 3 takes 0{(7 log m): For each fc = 1, . . . , cr, the component prv{Li)k can 
be determined by binary search over the list I[k][prv(Ri^i)k — m], I[k][prv{Ri^i)k — 
m + 1], . . . ,I{k]\prv{Ri-i)k]- By Li > Ri-i — m, the claim follows. 

The computation of prv{Ri) in line 7 takes 0{(j\og{Ri — Ri-i + m)). Simply 
observe that in the prefix ending at position Ri there can be at most Ri — Li more 
occurrences of the fc'th character than there are in the prefix ending at position 
Li. Therefore, as before, we can determine prv{Ri)k by binary search over the list 
Ilk][pry{L,)k],I[k]{prv{L,)k + I], . . . , I[k]\prv{L,)k + R^ - U]. Using the fact that 
Li > Ri-i — m, the desired bound follows. 

The last three observations imply 

Ai(s, q) = O ^o-JlogTO + c7^1og(i?j - Ri-i + m)^ . 

Notice that this is an overestimate, since line 7 is only executed if no occurrence was 
found after the current update of R (line 4). Standard algebraic manipulations using 
Jensen's inequality (see, e.g. [16]) yield X)/=ilog(i?i — Ri-i + m) < J log (j +m) . 
Therefore we obtain 

Ai(s,g) =o(aJlog(^+m)) . (2) 

Average case analysis of J The worst case running time of the Jumping Algorithm, 
in ciithcir implementation, is superlinear, since there exist strings s of any length n 
and Parikh vectors q such that J = 0{n): For instance, on the string s = ababab . . .ab 
and q = (2,0), the algorithm will execute n/2 jumps. 

^ Tlie Li and Ri coincide with the Lk and Rk from the proof of Theorem 1 almost but not 
completely: When an occurrence is found after the update of L, then the corresponding 
pair Lk,Rk is skipped here. The reason is that now we are only considering those updates 
that carry a computational cost. 
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This sharply contrasts with the experimental evaluation we present later. The 
Jumping Algorithm appears to have in practice a sublinear behavior. In the rest of 
this section we provide an average case analysis of the running time of the Jumping 
Algorithm leading to the conclusion that its expected running time is sublinear. 

We assume that the string s is given as a sequence of i.i.d. random variables 
uniformly distributed over the alphabet S. According to Knuth et al. [17] "It might be 
argued that the average case taken over random strings is of little interest, since a user 
rarely searches for a random string. However, this model is a reasonable approximation 
when we consider those pieces of text that do not contain the pattern [...]". The 
experimental results wc provide will show that this is indeed the case. 

Let us concentrate on the behaviour of the algorithm when scanning a (piece of 
the) string which does not contain a match. According to the above observation we 
can reasonably take this as a measure of the performance of the algorithm, considering 
that for any match found there is an additional step of size 1, which we can charge 
as the cost of the output. 

Let Em.a denote the expected value of the distance between R and L, following 
an update of R, i.e. if L is in position i, then we are interested in the value £ such 
that FiRSTFiT(prt;(i) + q) = i + £. Notice that the probabilistic assumptions made 
on the string, together with the assumption of absence of matches, allows us to treat 
this value as independent of the position i. We will show the following result about 
Em,a- For the sake of the clarity, we defer the proof of this technical fact to the next 
section. 



At each iteration (when there is no match) the L pointer is moved forward to 
the farthest position from R such that the Parikh vector of the substring between L 
and R is a. sub-Parikh vector of q. In particular, we can upper bound the distance 
between the new positions of L and R with m. Thus for the expected number of jumps 
performed by the algorithm, measured as the average number of times we move L, 
we have 



Recalling (1) and (2), and using (3) for a random instance we have the following 
result concerning the average case complexity of the Jumping Algorithm. 

Theorem 2. Let s G S* be fixed. Algorithm Jumping Algorithm finds all occurrences 
of a query q 



which can be constructed in a preprocessing step in time 0{n); 
2. in expected time 0(n(j^|^)^/^;^) using a wavelet tree of s of size 0{n), which 

can be computed in a preprocessing step in time 0{n). 

We conclude this section by remarking once more that the above estimate ob- 
tained by the approximating probabilistic automaton appears to be confirmed by the 
experiments. 





(3) 



1. in expected time 0(n( j^^^)"'"/^ ^ 



) using an inverted prefix table of size 0{n), 
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The proof of Lemma 3 We shall argue asymptotically with m and according to 
whether or not the Parikh vector q is balanced, and in the latter case according to 
its degree of unbalancedness, measured as the magnitude of its largest and smallest 
components. 



Case 1. q is balanced, i.e., q = (™, . . . , ™). Then, from equations (7) and (12) of [19], 
it follows that 

m2-'"( ™ ) if(T = 2, 



2ma In otherwise. 



The; author of [19] stiidicd a variant of the well known coupon collector problem 
in which the collector has to accumulate a certain number of copies of each coupon. It 
should not be hard to see that by identifying the characters with the coupon types, the 
random string with the sequence of coupons obtained, and the query Parikh vector 
with the number of copies we require for each coupon type, the expected time when 
the collection is finished is the same as our Em,a- It is easy to see that (4) provides 
the claimed bound of Lemma 3. 



Case 2. q= (gi, . . . ,qa) (^, . . . , ^). Assume, w.l.o.g., that qi > q2 > ■ ■ ■ > qa- We 
shall argue by cases according to the magnitude of qi. 



Subcase 2.1. Suppose qi = ^ + Q yy "^ a'^ j ■ us consider again the analogy with 

the coupon collector who has to collect qi copies of coupons of type i, with i = 1, . . . , cr. 
Clearly the collection is not completed until the gi'th copy of the coupon of type 1 
has been collected. We can model the collection of these type-1 coupons as a sequence 
of Bernoulli trials with probability of success 1/cr. The expected waiting time until 
the gi'th success is aqi and from the previous observation this is also a lower bound 
on Em,<j- Thus, 



Em,a > (rqi 



a ^— + J7 ^^ "^^'^^ ^ ^ = J7 + V ma In , 



which confirms the bound claimed, also in this case. 



Subcase 2.2. Finally, assume that g'l = 




. Then, for the smallest 



component qa of q we have q^ > fn — {a — l)qi = ™ — o ma In aj . Consider now 

the balanced Parikh vector q' = (qc, . . . ,qa)- We have that q' < q and \q'\ = aq^. 
By the analysis of Case 1., above, on balanced Parikh vectors, and observing that 
collecting q implies collecting q' also, it follows that 
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E > K 



= J? {aq„ + V In cr^ 

= ^fT ^— — o ^ \/mc7 In cr^^ + ^cr^ ^ — — o ^\/TOcr In ct^ ^ In cr 
= /2 — o ^crVmcr In cr^ + ^ ma In cr — o ^cr^ In ay/ma In cr^ ^ , 

in agreement with the bound claimed. This completes the proof. 

4.5 Simulations 

We implemented the Jumping Algorithm in CH — h in order to study the number of 
jumps J. We ran it on random strings of different lengths and over different alphabet 
sizes. The underlying probability model is an i.i.d. model with uniform distribution. 
We sampled random query vectors with length between logn (= logj n) and Vn, 
where n is the length of the string. Our queries were of one of two types: 

1. Quasi-balanced Parikh vectors: Of the form (^i, . . . , q„) with qi G {x — e,x + e), 
and X running from logn/ a to v^n/cr. For simplicity, we fixed e = 10 in all our 
experiments, and sampled uniformly at random from all quasi-balanced vectors 
around each x. 

2. Random Parikh vectors with fixed length m. These were sampled uniformly at 
random from the space of all Parikh vectors with length m. 

The rationale for using quasi-balanced queries is that those are clearly worst- 
case for the number of jumps J, since J depends on the shift length, which in turn 
depends on FiRSTFiT(prv(L) + q). Since we are searching in a random string with 
uniform character distribution, we can expect to have minimal FlRSTFlT(prt;(i) -|- q) 
if q is close to balanced, i.e. if all entries q^ are roughly the same. This is confirmed 
by our experimental results which show that J decreases dramatically if the queries 
are not balanced (Fig. 7, right). 

We ran experiments on random strings over different alphabet sizes, and observe 
that our average case analysis agrees well with the simulation results for random 
strings and random quasi-balanced query vectors. Plots for n = 10^ and n = 10^ with 
alphabet sizes (7 = 2,4, 16 resp. cr = 4, 16 arc shown in Fig. 6. 

In Fig. 5 we show comparisons between the running time of the Jumping algorithm 
and that of the simple window algorithm. The simulations over random strings and 
Parikh vectors of different sizes appear to perfectly agree with the guarantees provided 
by our asymptotic analyses. This is of particular importance from the point of view of 
the applications, as it shows that the complexity analysis does not hide big constants. 

To see how our algorithm behaves on non-random strings, we downloaded human 
DNA sequences from GenBank [12] and ran the Jumping Algorithm with random 
quasi-balanced queries on them. We found that the algorithm performs 2 to 10 times 
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Window algo 
Jumping algo 



1000 1500 
Query length 



Window algo 
Jumping algo 



Query length 



Fig. 5. Running time comparisons between the Jumping Algorithm and the window algo- 
rithm. The text is a random string (uniform i.i.d.) of size 9000000 from a four letter alphabet. 
Parikh vectors of different sizes between 10 and 2000 were randomly generated and the re- 
sults averaged over all queries of the same size. On the left are the results for quasi-balanced 
Parikh vectors (cf. text). On the right are the results for random Parikh vectors. 



fewer jumps on these DNA strings than on random strings of the same length, with 
the gain increasing as n increases. We show the results on a DNA sequence of 1 million 
bp (from Chromosome 11) in comparison with the average over 10 random strings of 
the same length (Fig. 7, left). 
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Fig. 6. Number of jumps for different alphabet sizes for random strings of size 100 000 (left) 
and 1000 000 (right). All queries are randomly generated quasi- balanced Parikh vectors (cf. 
text). Data averaged over 10 strings and all random queries of same length. 



5 Conclusion 

Our simulations appear to confirm that in practice the performance of the Jumping 
Algorithm is well predicted by the average case analysis we proposed. A more precise 
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Fig. 7. Number of jumps in random vs. nonrandom strings: Random strings over an alphabet 
of size 4 vs. a DNA sequence, all of length 1 000 0000, random quasi-balanced query vectors. 
Data averaged over 10 random strings and all queries with the same length (left). Comparison 
of quasi-balanced vs. arbitrary query vectors over random strings, alphabet size 4, length 
1 000 000, 10 strings. The data shown are averaged over all queries with same length m 
(right). 



analysis is needed, however. Our approach seems unhkely to lead to any refined average 
case analysis since that would imply improved results for the intricate variant of the 
coupon collector problem of [19]. 

Moreover, in order to better simulate DNA or other biological data, random string 
models other than uniform i.i.d. should also be analysed, such as first or higher order 
Markov chains. 

We remark that our wavelet tree variant of the Jumping Algorithm, which uses 
rank/select operations only, opens a new perspective on the study of Parikh vector 
matching. We have made another family of approximate pattern matching problems 
accessible to the use of self-indexing data structures [21]. We are particularly interested 
in compressed data structures which allow fast execution of rank and select operations, 
while at the same time using reduced storage space for the text. Thus, every step 
forward in this very active area can provide improvements for our problem. 
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